9 research outputs found
Strategies for improving low resource speech to text translation relying on pre-trained ASR models
This paper presents techniques and findings for improving the performance of
low-resource speech to text translation (ST). We conducted experiments on both
simulated and real-low resource setups, on language pairs English - Portuguese,
and Tamasheq - French respectively. Using the encoder-decoder framework for ST,
our results show that a multilingual automatic speech recognition system acts
as a good initialization under low-resource scenarios. Furthermore, using the
CTC as an additional objective for translation during training and decoding
helps to reorder the internal representations and improves the final
translation. Through our experiments, we try to identify various factors
(initializations, objectives, and hyper-parameters) that contribute the most
for improvements in low-resource setups. With only 300 hours of pre-training
data, our model achieved 7.3 BLEU score on Tamasheq - French data,
outperforming prior published works from IWSLT 2022 by 1.6 points
An Empirical Evaluation of Zero Resource Acoustic Unit Discovery
Acoustic unit discovery (AUD) is a process of automatically identifying a
categorical acoustic unit inventory from speech and producing corresponding
acoustic unit tokenizations. AUD provides an important avenue for unsupervised
acoustic model training in a zero resource setting where expert-provided
linguistic knowledge and transcribed speech are unavailable. Therefore, to
further facilitate zero-resource AUD process, in this paper, we demonstrate
acoustic feature representations can be significantly improved by (i)
performing linear discriminant analysis (LDA) in an unsupervised self-trained
fashion, and (ii) leveraging resources of other languages through building a
multilingual bottleneck (BN) feature extractor to give effective cross-lingual
generalization. Moreover, we perform comprehensive evaluations of AUD efficacy
on multiple downstream speech applications, and their correlated performance
suggests that AUD evaluations are feasible using different alternative language
resources when only a subset of these evaluation resources can be available in
typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201
Zero-Time Windowing Cepstral Coefficients for Dialect Classification
In this paper, we propose to use novel acoustic features, namely zero-time windowing cepstral coefficients (ZTWCC) for dialect classification. ZTWCC features are derived from high resolution spectrum obtained with zero-time windowing (ZTW) method, and were shown to be useful for discriminating speech sound characteristics effectively as compared to a DFT spectrum. Our proposed system is based on i-vectors trained on static and shifted delta coefficients of ZTWCC. The i-vectors are further whitened before classification. The proposed system is compared with i-vector baseline system trained on Mel frequency cepstral coefficient (MFCC) features. Classification results on STYRIALECT database (German) and UT-Podcast (English) database revealed that the system with proposed features outperformed aforementioned baseline system. Our detailed experimental analysis on dialect classification shows that the i-vector system can indeed exploit high spectral resolution of ZTWCC and hence performed better than MFCC features based system.Peer reviewe
Automatic Processing Pipeline for Collecting and Annotating Air-Traffic Voice Communication Data
This document describes our pipeline for automatic processing of ATCO pilot audio communication we developed as part of the ATCO2 project. So far, we collected two thousand hours of audio recordings that we either preprocessed for the transcribers or used for semi-supervised training. Both methods of using the collected data can further improve our pipeline by retraining our models. The proposed automatic processing pipeline is a cascade of many standalone components: (a) segmentation, (b) volume control, (c) signal-to-noise ratio filtering, (d) diarization, (e) ‘speech-to-text’ (ASR) module, (f) English language detection, (g) call-sign code recognition, (h) ATCO—pilot classification and (i) highlighting commands and values. The key component of the pipeline is a speech-to-text transcription system that has to be trained with real-world ATC data; otherwise, the performance is poor. In order to further improve speech-to-text performance, we apply both semi-supervised training with our recordings and the contextual adaptation that uses a list of plausible callsigns from surveillance data as auxiliary information. Downstream NLP/NLU tasks are important from an application point of view. These application tasks need accurate models operating on top of the real speech-to-text output; thus, there is a need for more data too. Creating ATC data is the main aspiration of the ATCO2 project. At the end of the project, the data will be packaged and distributed by ELDA